AITopics

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.39)

Neural Information Processing SystemsFeb-13-2026, 01:16:25 GMT

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing Systems http://nips.cc/

speaker encoder, speech, utterance, (15 more...)

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.66)

Neural Information Processing SystemsFeb-12-2026, 17:28:44 GMT

Neural Voice Cloning with a Few Samples

Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou

Neural Information Processing Systems http://nips.cc/

adaptation, generative model, speaker adaptation, (13 more...)

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Industry: Information Technology > Security & Privacy (0.53)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsNov-20-2025, 22:22:08 GMT

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

multispeaker text-to-speech synthesis, speaker verification, transfer learning, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.39)

Neural Information Processing SystemsNov-20-2025, 17:03:52 GMT

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, speech, (19 more...)

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.66)

Neural Information Processing SystemsNov-20-2025, 16:03:29 GMT

Neural Voice Cloning with a Few Samples

Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou

V oice cloning is a highly desired feature for personalized speech interfaces.

artificial intelligence, deep learning, machine learning, (16 more...)

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Industry: Information Technology > Security & Privacy (0.53)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceSep-9-2025

The First Voice Timbre Attribute Detection Challenge

Chen, Liping, He, Jinghao, Sheng, Zhengyan, Lee, Kong Aik, Ling, Zhen-Hua

The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.

artificial intelligence, descriptor, machine learning, (16 more...)

2509.06635

Country: Asia > China (0.32)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.30)

Singh, Jaskaran, Chowdhury, Amartya Roy, Prabhakar, Raghav, W, Varshul C.

MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

arXiv.org Artificial IntelligenceAug-21-2025

Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at https://github.com/dubverse-ai/MahaTTSv2

artificial intelligence, arxiv preprint arxiv, natural language, (14 more...)

2508.14049

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Bhadoriya, Ayush Singh, Shinde, Abhishek Nikunj, Pandey, Isha, Ramakrishnan, Ganesh

A2TTS: TTS for Low Resource Indian Languages

arXiv.org Artificial IntelligenceJul-22-2025

We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.

artificial intelligence, machine learning, speech, (15 more...)

2507.15272

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.39)

arXiv.org Artificial IntelligenceJun-24-2025

Introducing voice timbre attribute detection

He, Jinghao, Sheng, Zhengyan, Chen, Liping, Lee, Kong Aik, Ling, Zhen-Hua

This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.

artificial intelligence, descriptor, machine learning, (18 more...)

2505.09661

Country: Asia > China (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)